-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial integration of TUF client into Nexus #469
Conversation
nexus is async, tough is not async, but tough uses reqwest in blocking mode which is async. https://docs.rs/reqwest/0.11.7/reqwest/blocking/index.html: > Conversely, the functionality in `reqwest::blocking` must not be > executed within an async runtime, or it will panic when attempting to > block. moves the relevant code out to updates.rs, since the function can't borrow self due to lifetime constraints.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome start, super cool to see this coming together!
I think your judicious use of spawn_blocking
to decouple from the tough async constraints was a good call - the rest of PR looks mostly unaffected by that part of the codebase.
|
||
[updates] | ||
# If not present, accessing the TUF updates repository will fail | ||
#tuf_trusted_root = "/opt/oxide/nexus/pkg/root.json" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where would this root.json
come from? I'm noticing this part of Nexus config present-but-commented-out everywhere, sorta wondering what the plan is gonna be for getting this running in a dev/test environment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should probably write up a script for generating one of these, but copying and pasting from history
:
tuftool root init 1.root.json
tuftool root gen-rsa-key 1.root.json ~/test.key -r root snapshot targets timestamp
[ manually edit a few things in 1.root.json, namely expiration and key thresholds ]
tuftool root sign -k ~/test.key 1.root.json
then creating a repository from the files in ~/in
:
tuftool create -k ~/test.key --outdir repo --root 1.root.json \
--snapshot-expires 'in 3 days' --targets-expires 'in 3 days' --timestamp-expires 'in 3 days' \
--snapshot-version $(date +%s) --targets-version $(date +%s) --timestamp-version $(date +%s) \
--add-targets ~/in
/** | ||
* Refresh update metadata | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a good candidate for an addition to https://github.com/oxidecomputer/omicron/blob/main/tools/oxapi_demo , as a convenience
}] | ||
async fn updates_refresh( | ||
rqctx: Arc<RequestContext<Arc<ServerContext>>>, | ||
// _query_params: Query<PaginatedById>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dead code?
nexus/src/nexus.rs
Outdated
// first, the JoinError: | ||
.map_err(|e| Error::InternalError { internal_message: e.to_string() })? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This only happens when the call to read_artifacts
panics, right?
- I think we're using panic=abort, so this shouldn't happen,
- ... if it did, panicking ASAP seems like the right behavior.
In other codebases, I just unwrapped the first error for this reason: https://github.com/oxidecomputer/async-bb8-diesel/blob/09f17db081be85abdf4f773a385381637b9c8d76/src/connection.rs#L42-L45
nexus/src/nexus.rs
Outdated
.map_err(|e| Error::InternalError { internal_message: e.to_string() })? | ||
// next, the boxed dyn Error: | ||
.map_err(|e| Error::InternalError { | ||
internal_message: e.to_string(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Could we add a little context here - just something like "we were trying to refresh updates, and saw {}"
// FIXME: if we hit an error in any of these database calls, the available artifact table | ||
// will be out of sync with the current artifacts.json. can we do a transaction or | ||
// something? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we do a transaction or something
Yeah, there's an example here: https://github.com/oxidecomputer/async-bb8-diesel/blob/22c26ef840075a57b18ac2eae3ead989b992773e/examples/usage.rs#L72-L82
If we wanted do this, we would probably have a call to self.db_datastore
that takes all the artifacts we wanna upsert at once, so the transaction could be self-contained in the datastore function.
All that being said... the upserts are idempotent, right? We could have a separate single-column table for targets_role_version
, which we update once all the necessary artifacts are updated. Then the steps would be:
- upsert all the artifacts
- update the "desired" targets role version
- delete all older artifacts
And I don't think they'd need to be transactional anymore.
Either way, if you wanna punt on this until after the PR is merged, could we file a bug to track? DB concurrency shenanigans are hairy, but I don't want us to lose track of 'em.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this function called? If it's from an API endpoint, and the API request fails, then at least the caller knows it didn't work and we're basically saying its their responsibility to retry until it works.
Is it also a problem that it might be inconsistent? That is, we might have updated some of the artifacts but not all of them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's inconsistent, then either Nexus will be missing information about a newer update, or might try to update to a removed artifact (if we ever have to do that for some reason).
tuf_metadata_base_url: "http://localhost:8000/metadata".to_string(), | ||
tuf_targets_base_url: "http://localhost:8000/targets".to_string(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These might also be good candidates for the Nexus config file?
Also, should we have instructions on "how to set this up"? I don't think these servers are running on most people's machines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've just been running python3 -m http.server
in the output directory from tuftool. (Instructions for making a repo probably should go along with that.)
These maybe should belong to the Nexus config file to start, but these eventually need to be configurable by an operator... thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the longer term plan here? Are we building a new component to serve this endpoint? Is it part of the control plane? If not, maybe we want to treat this like we do Clickhouse and CockroachDB, with easy ways to run it by hand that we use in the test suite to get coverage of this stuff?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Longer-term plan is for this to default to something.oxide.computer via HTTPS and be stored in the database, ready for an end-user to change it if they need to.
|
||
let mut current_version = None; | ||
for artifact in artifacts { | ||
current_version = Some(artifact.targets_role_version); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a sanity-check -- if we see multiple values of artifact.targets_role_version
, they should all be the same, right? (trying to mentally verify that the order of artifacts
couldn't change behavior)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, these will all be the same.
This is kind of messy, it's a last minute "well crap I need this value" that we can clean up when moving the upsert and delete operations into the same transaction.
.update_available_artifact_hard_delete_outdated(current_version) | ||
.await?; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When do you think we'd actually try to apply updates (basically, making the calls to POST /update
on each sled) to the rest of the rack? Would that be at the end of this function, or some other time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the demo, probably immediately just for simplicity's sake?
At release I think a background task that periodically checks for out-of-date artifacts on the rack and starts applying updates is probably correct. Otherwise we need to handle applying updates here as well as when a sled-agent comes online (e.g. brand new sled, or sled that's been powered off for a few weeks).
pub name: String, | ||
pub version: i64, | ||
pub kind: UpdateArtifactKind, | ||
pub target: String, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(... I feel like I may have asked you this before, but I can't remember the answer)
What is this field? Is it a reference to the targets.json
file somehow? If it's the name of the artifact, how is it different from name
?
Related: Could we add some docs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's the name of the target in targets.json
, yeah. I should do a docs / "is this name sensible" pass again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As Sean said, this looks like a great start!
nexus/examples/config-file.toml
Outdated
mode = "file" | ||
path = "logs/server.log" | ||
if_exists = "append" | ||
#mode = "file" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file was supposed to be exactly the same as nexus/examples/config.toml
, except it logs to a file instead of stdout. I think maybe you didn't mean to comment out this section?
pub struct UpdatesConfig { | ||
/** Trusted root.json role for the TUF updates repository. If `None`, accessing the TUF | ||
* repository will fail. */ | ||
pub tuf_trusted_root: Option<PathBuf>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any deployment where we might want to leave this out? (i.e., why not require this? is it a pain for development?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I imagine this could be temporary, but given all the work going on right now I didn't want to break folks' testing workflows. There should not be production deployments where this is somehow left unconfigured.
nexus/src/config.rs
Outdated
@@ -330,6 +340,7 @@ mod test { | |||
.parse() | |||
.unwrap() | |||
}, | |||
updates: UpdatesConfig { tuf_trusted_root: None }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this test fills everything out instead of leaving anything optional. It seems useful to have at least one test that does that for the updates section.
@@ -142,8 +144,8 @@ impl Nexus { | |||
* Create a new Nexus instance for the given rack id `rack_id` | |||
*/ | |||
/* TODO-polish revisit rack metadata */ | |||
pub fn new_with_id( | |||
rack_id: &Uuid, | |||
pub async fn new_with_id( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering out loud: I've been trying to keep this sync because it's tempting to put async things in here that belong as background processes (e.g., saga recovery or database initialization). I don't fully grok the flow of the root JSON file but it looks like this is reasonable to be done immediately. But I wonder if the (async) caller should pass in the file contents instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also keep it sync and just use std::fs::read
. Root roles are generally expected to be very small (hence why we're just keeping it in memory).
Alternatively we can just pass along the PathBuf
of the root to Nexus, and Nexus can read it whenever /updates/refresh
is hit.
tuf_metadata_base_url: "http://localhost:8000/metadata".to_string(), | ||
tuf_targets_base_url: "http://localhost:8000/targets".to_string(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the longer term plan here? Are we building a new component to serve this endpoint? Is it part of the control plane? If not, maybe we want to treat this like we do Clickhouse and CockroachDB, with easy ways to run it by hand that we use in the test suite to get coverage of this stuff?
// FIXME: if we hit an error in any of these database calls, the available artifact table | ||
// will be out of sync with the current artifacts.json. can we do a transaction or | ||
// something? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this function called? If it's from an API endpoint, and the API request fails, then at least the caller knows it didn't work and we're basically saying its their responsibility to retry until it works.
Is it also a problem that it might be inconsistent? That is, we might have updated some of the artifacts but not all of them?
... since We only get a JoinError if the task we're waiting on panics.
RFD 183 recommends the use of TUF as a transport mechanism to get updates from Oxide to the rack for a few reasons, one of which is "I wrote a TUF client once already". This is the initial integration work to add tough to Nexus.
I am now, of course, paying the costs of decisions I made two years ago. tough is not yet async (since it was written right as async/.await was starting to stabilize),
RepositoryBuilder
andRepository
are notSend
, and tough has a defensively-built API, intended for systems that have few mechanisms to handle threats to an update system. So there's this awkwardtokio::task::spawn_blocking
boundary between tough and the rest of Nexus.Note the update URLs are currently hardcoded to paths on
http://localhost:8000
because Nexus doesn't fetch theRack
from the database yet.Quick top-level overview
artifacts.json
is a target in the TUF repository we build. It describes updates available for a specific "thing" (in this case, the not-yet-existing CockroachDB zone), an integer version number, what kind of thing it is (in this case, a zone), and the name of the target as listed in the targets.json TUF role.This code adds an
/updates/refresh
endpoint to more-or-less copy the data from this file (and the target's checksum/size) into a database table.The goal is for Nexus to be aware of what updates are available for different "things", and be able to download those while adhering to the TUF specification (not fetching if the metadata is expired, and checking the checksum and size), but not having to re-fetch the metadata every time it wants to fetch an artifact.
(I am going to work on artifact fetching next, and either append it to this PR or in a next PR.)